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Abstract —Image based reconstruction of urban environ¬ 
ments is a challenging problem that deals with optimization 
of large number of variables, and has several sources of errors 
like the presence of dynamic objects. Since most large scale 
approaches make the assumption of observing static scenes, 
dynamic objects are relegated to the noise modeling section 
of such systems. This is an approach of convenience since the 
RANSAC based framework used to compute most multiview 
geometric quantities for static scenes naturally confine dynamic 
objects to the class of outlier measurements. However, recon¬ 
structing dynamic objects along with the static environment 
helps us get a complete picture of an urban environment. Such 
understanding can then be used for important robotic tasks like 
path planning for autonomous navigation, obstacle tracking and 
avoidance, and other areas. 

In this paper, we propose a system for robust SLAM that 
works in both static and dynamic environments. To overcome 
the challenge of dynamic objects in the scene, we propose 
a new model to incorporate semantic constraints into the 
reconstruction algorithm. While some of these constraints are 
based on multi-layered dense CRFs trained over appearance as 
well as motion cues, other proposed constraints can be expressed 
as additional terms in the bundle adjustment optimization 
process that does iterative refinement of 3D structure and 
camera / object motion trajectories. We show results on the 
challenging KITTI urban dataset for accuracy of motion 
segmentation and reconstruction of the trajectory and shape 
of moving objects relative to ground truth. We are able to 
show average relative error reduction by a significant amount 
for moving object trajectory reconstruction relative to state-of- 
the-art methods like VISO 2[16], as well as standard bundle 
adjustment algorithms. 

1. INTRODUCTION 

Vision based SLAM (vSLAM) is becoming an increas¬ 
ingly widely researched problem, partly because of its ability 
to produce good quality reconstructions with affordable 
hardware, and partly because of increasing computational 
power that results in computational affordability of huge opti¬ 
mization problems. While vSLAM systems are maturing and 
getting progressively complicated, the two main components 
remain camera localization (or camera pose estimation) and 
3D reconstruction. Generally, these two components precede 
an optimization based joint refinement of both camera pose 
and 3D structure, called bundle adjustment. 

In urban environments, vSLAM is challenging particularly 
because of the presence of dynamic objects. Indeed, it 
is difficult to capture videos of a city without observing 
moving objects like cars or people. However, dynamic ob¬ 
jects are a source of error in vSLAM systems, since the 
basic components of such algorithms make the fundamental 
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Fig. 1: Overview of our approach: Top left A frame from 
highway sequence of the KITTI dataset has been used as input. 
Bottom left Semantic Motion Segmentation to provide both motion 
and semantic understanding of the scene. Right 3d reconstruction 
of the semantic map on the highway sequence with the trajectories 
of the moving objects overlaid. (Best viewed in color) 

assumption that the world being observed is static. While 
optimization algorithms are designed to handle random noise 
in observations, dynamic objects are a source of structured 
noise since they do not conform to models of random noise 
distributions (like Gaussian distributions, for example). To 
overcome such difficulties, RANSAC based procedures for 
camera pose estimation and 3D reconstruction have been 
developed in the past, which treat dynamic objects as outliers 
and remove them from the reconstruction process. 

While successful attempts have been made to isolate and 
discard dynamic objects from such reconstruction processes, 
there are several recent applications that benefit from re¬ 
constructions of such objects. For example, reconstructing 
dynamic urban traffic scenes are useful since traffic patterns 
can be studied to produce autonomous vehicles that can bet¬ 
ter navigate such situations. Reconstructing dynamic objects 
are also useful in indoor environments when robots need to 
identify and avoid moving obstacles in their path [5]. 

Reconstructing dynamic objects in videos present several 
challenges. Firstly, moving objects in images and videos 
have to be segmented and isolated, before they can be 
reconstructed. This in itself is a challenging problem in the 
presence of image noise and scene clutter. Degeneracies in 
camera motion also prevent accurate motion segmentation of 
such objects. Secondly, upon isolation, a separate vSLAM 
procedure must be initialized for each moving object, since 
objects like cars often move independent of each other and 
thus have to be treated as such. Often moving objects like 
cars occupy only a small portion of the image space in a 
video (Figure [^, because of which dense reconstructions are 
infeasible since getting long accurate feature correspondence 
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Fig. 2: Illustration of the proposed method. The system takes a sequence of rectified stereo images (A). Our formulation computes 
the semantic motion segmentation (D) using the depth(B) and optical flow(C) information. We segment the moving objects (E) from the 
stationary background (F). We compute accurate structure of the static background (J) and the moving object (H) with the help of bundle 
adjustment (G). This leads to state-of-the-art 3d reconstruction of the dynamic environment(K) with the help of moving object trajectory 
estimation (I). (Best viewed in color.) 


tracks for such objects is difficult. Absence of large number 
of feature correspondences also hinders accurate estimation 
of the car’s pose with respect to a world coordinate system. 
Finally, such objects cannot be reconstructed in isolation 
from the static scene, since optimization algorithms like 
bundle adjustment do not preserve contextual information 
like the fact that the car must move along a direction 
perpendicular to the normal of the road surface. 

In this paper, we look at the problem of dynamic scene 
reconstruction. We present an end-to-end system that takes 
a video, segments the scene into static and dynamic com¬ 
ponents and reconstructs both static and dynamic objects 
separately. Additionally, while reconstructing the dynamic 
object, we impose several novel constraints into the bundle 
adjustment refinement that deal with noisy feature corre¬ 
spondences, erroneous object pose estimation, and contex¬ 
tual information. To be precise, we propose the following 
contributions in this paper 

• We use a new semantic motion segmentation algorithm 
using multi-layer dense CRF which provides state-of- 
the-art motion segmentation and object class labelling. 

• We incorporate semantic contextual information like 
support relations between the road surface and object 
motion, which helps better localize the moving object’s 
pose vis-a-vis the world coordinate system, and also 
helps in reconstructing them. 

• We describe a random sampling strategy that enables us 
to maintain the feasibility of the optimization problem 
in spite of the addition of a large number of variables. 


We evaluate our system on 4 challenging KITTI Urban 
tracking datasets captured using a stereo camera. We are able 
to achieve average relative error reduction by 41.58 % for 
one sequence based on Absolute Trajectory Error (ATE) for 
Root Mean Square Error relative to VIS02 [16], while we 
get an improvement of 13.89 % relative to traditional bundle 
adjustment. 

This paper is organized as follows. We cover related work 
in Section |n| and present a system overview in Section HI 


We describe process of motion segmentation using object 
class semantic constraints in Section [Ivl We track and ini¬ 
tialize multiple moving bodies which we then optimize using 
a novel bundle adjustment in Section |V| Finally we show 


experimental results on challenging datasets in Section |Vl 
and conclude in Section 


II. RELATED WORK 

Our system involves several components like semantic 
motion segmentation, dynamic body reconstruction using 
multibody SLAM, and trajectory optimization. We focus on 
each one of our components and draw references to relevant 
works in the literature in this section. 


A. Dynamic body reconstruction 

Dynamic body reconstruction is a relatively new de¬ 
velopment in 3D reconstruction with sparse literature on 
it. The few solutions in the literature can be categorized 
into decoupled and joint approaches. Joint approaches like 
[6] use monocular cameras to jointly estimate the depth 
maps, do motion segmentation and motion estimation of 
multiple bodies. Decoupled approaches like [7] [8] have 
a sequential pipeline where they segment motion and in¬ 
dependently reconstruct the moving and static scenes. Our 
approach is a decoupled approach but essentially differs from 
other approaches, as we use a novel algorithm for semantic 
motion segmentation which is leveraged to obtain accurate 
localization of the moving objects through smoothness and 
planar constraints to give an accurate semantic map. 

B. Semantic motion segmentation 

Semantics have been used extensively for reconstruction 
[1] [4] [3] but haven’t been exploited in motion segmentation 
till recently [17]. Generally, motion segmentation has been 
approached using geometric constraints [8] or by using affine 
trajectory clustering into subspaces [9]. In our approach we 
use motion along with semantic cues to segment the scene 
into static and dynamic objects, which allows us to work 








































with fast moving cars, occlusions and disparity failure. We 
show a typical result of the motion segmentation algorithm 
in (Figure [T]) (bottom left) where each variable is labelled for 
both multi-variate semantic class and binary motion class. 

C. Multi-body vSLAM 

In dynamic scenes, decoupled approaches have motion 
segmentation followed by tracking each independently mov¬ 
ing object to perform vSLAM. Traditional SLAM approaches 
with single motion model fail in such cases, as moving 
bodies cause reconstruction errors. Our approach employs 
Multi Body vSLAM framework [8] where we propose a 
novel trajectory optimization to with semantic constraints to 
show dense reconstruction results of moving objects. 

D. Semantic constraints for reconstruction 

Recent approaches to 3D reconstruction have either used 
semantic information in a qualitative manner [1], or have 
only proposed to reconstruct indoor scenes using such in¬ 
formation [5]. Only Yuan et al [7] propose to add semantic 
constraints for reconstruction. While our approach is similar 
to theirs, they use strict constraints for motion segmenta¬ 
tion without regard to appearance information whereas our 
approach works for more general scenarios as it employs a 
more powerful inference engine in the CRF. 


III. SYSTEM OVERVIEW 

We give an illustration of our system in Figure Given 
rectified input images from a stereo camera, we first com¬ 
pute low level features like SIFT descriptors, optical fiow 
(using DeepFlow [14]) and stereo [19]. These are then 
used to compute semantic motion segmentation, as explained 
in Section IV Once semantic segmentation is done per 
image, we isolate stationary objects from moving objects 
and reconstruct them independently. To do this, we connect 
moving objects across frames into tracks by computing SIFT 
matches on dense SIFT features [21]. Then we perform 
camera resectioning using EPnP [25] for stationary and ICP 
for moving objects, to register their 3D points across frames. 
This is then followed by bundle adjustment with semantic 
constraints (Section]^, where we make use of the semantic 
and motion labels assigned to the segmented scene to obtain 
accurate 3D reconstruction. We then fuse the stationary and 
moving object reconstructions using an algorithm based on 
the truncated signed distance function (TSDF) [18]. Finally, 
we transfer labels from 2D images to 3D data by projecting 
3D data onto the images, and using a winner-takes-it-all 
approach to assign labels to 3D data from the labels of the 
projected points. 


IV. SEMANTIC MOTION SEGMENTATION 

In this section, we deal with the first module of our system. 
A sample result of our segmentation algorithm is shown 
in Figure With input images from a stereo camera, we 
give an overview on how we perform semantic segmentation 
[11] to first separate dynamic objects from the static scene. 
We combine classical semantic segmentation with a new set 
of motion constraints proposed in [17] to perform semantic 


motion segmentation, that jointly optimizes for semantic and 
motion segmentation. While we give an overview of the 
formulation in this section, for brevity, methodologies used 
for training, testing and the rationale behind using mean field 
approximations is outlined in [17]. 

We do joint estimation of motion and object labels by 
exploiting the fact that they are interrelated. We formulate 
the problem as a joint optimization problem of two parts, 
object class segmentation and motion segmentation. We 
define a dense CRF where the set of random variables 
Z = {Zi ,Z 2 , ....,Za^} corresponds to the set of all image pixels 
i G y = {1,2,...,A}. Let denote the neighbors of the 
variable Zi in image space. Any possible assignment of labels 
to the random variables will be called a labelling and denoted 
by z. We define the energy of the joint CRF as 

E^{z)=Y,yi^i)+ E (1) 

ier ierjeJ^i 

where is the joint unary potential and represents 
the joint pairwise potential. We describe these terms in brief 
in the next two sections. 


A. Joint Unary Potential: 

The joint unary potential is defined as an interactive 
potential term which incorporates a relationship between the 
object class and the corresponding motion likelihood for each 
pixel. Each random variable Zi = {XiJi\ takes a label Zi = 
{xi,yi\, from the product space of object class and motion 
labels. The combined unary potential of the joint CRF is 


= wfixi) + vf{yi) + {Xi,yi) (2) 


The object class unary potential describes the cost 

of the pixel taking the corresponding label and is com¬ 
puted using pre-trained models of color, texture and location 
features for each object as in [2]. The new motion class 
unary potential (yi) is given by the motion likelihood 
of the pixel and is computed as the difference between the 
predicted and the measured optical fiow. The measured fiow 
is computed using dense optical fiow. The predicted fiow 
measures how much the object needs to move given its depth 
in the current image and assuming it is a stationary object. 
Objects deviating from the predicted fiow are likely to be 
dynamic objects. It is computed as 


X' =KRK'X^KT/z (3) 


where K is the intrinsic camera matrix, R and T are the 
translation and rotation of the camera respectively and z 
is the depth [17]. X is the location of the pixel in image 
coordinates and X' is the predicted fiow vector of the 
pixel given from the motion of the camera. Thus the unary 
potential is now computed as 


where Z is the sum of the covariances of the predicted and 
measured flows as shown in [15], Sc X' —X' represents the 
difference of the predicted fiow and measured fiow. The 



object-motion unary potential incorporates the 

object-motion class compatibility and can be expressed as 

(5) 

where G [—1,1] is a learnt correlation term between 

the motion and object class label. helps in 

incorporating the relationship between an object class and 
its motion (for example, trees and roads are stationary, but 
cars move). We use a piecewise method for training the label 
and motion correlation matrices using the modified Adaboost 
framework [17], as described in [17]. 


B. Joint Pairwise Potential: 

The joint pairwise potential \\f^ enforces the con¬ 

sistency of object and motion class between the neighboring 
pixels. We compute the joint pairwise potential as 

wf [xj,yj]) = ¥Fjixi,Xj) + wif {yi,yj) ( 6 ) 


where we disregard the joint pairwise term over the product 
space. The object class pairwise potential takes the form of 
a Potts model 


¥i!jixi,Xj 



0 

PiiJ) 


if Xi = Xj 

if ^ 


(7) 


where p{ij) is given as the standard pairwise potential as 
given in [12]. 

The motion class pairwise potential xj /^is given as 
the relationship between neighboring pixels and encourages 
the adjacent pixels in the image to have similar motion label. 
The cost of the function is defined as 



if yi=yj 
if 7^ yj 


( 8 ) 


where g{ij) is an edge feature based on the difference 
between the flow of the neighboring pixels = \f{yi) — 

f{yj)\) & /(•) is returns the flow of the corresponding pixel. 


C. Inference and learning: 

We follow Krahenbuhl et al [12] to perform inference on 
this dense CRF using a mean field approximation. In this 
approach we try to find a mean field approximation Q(z) that 
minimizes the KL-divergence D{Q\\P) among all the distri¬ 
butions Q that can be expressed as a product of independent 
marginals, Q{z) = YliQi{zi)- We can further factorize Q into 
a product of marginals over multi-class object and binary 
motion segmentation layer by taking Qi{zi) = Qf {xi)Qf (yi). 
Here Qf is a multi-class distribution over the object labels, 
and Q'f' is a binary distribution over moving or stationary 
classes {Q^f' G {0,1}). We compute inference separately for 
both the layers i.e object class layer and motion layer [17]. 


V. TRAJECTORY ESTIMATION 

The motion segmented images of the static world and 
moving objects are then used as input to localize and map 
each object independently. In this section, we propose a novel 
framework for trajectory computation for static or moving 
objects from a moving platform. The below process is carried 
out for all the moving objects and the camera mounted 



Fig. 3: Reconstruction result for KITTI 4 sequence. Note the 
accurate reconstruction of trajectories and of the car and the camera. 
Please see supplementary video for further details. 


vehicl^ Let us introduce some preliminary notations for 
trajectory computation. The extrinsic parameters for frame 
k = 1,2,3,4...^ are the rotation matrix and the camera 
center Q relative to a world coordinate system. Then the 
translation vector between the world and the camera coordi¬ 
nate systems is = -RkCk . 

a) Trajectory Initialization: We initialize the motion 
of each object separately using SIFT feature points. SIFT 
feature points are tracked using dense optical flow between 
consecutive pair of frames. Key points with valid depth 
values are used in a 3-point-algorithm within a RANSAC 
framework to find the robust relative transformation between 
pairs of frames. We obtain pose estimates of the moving 
object in the world frame by chaining the relative transfor¬ 
mations together in succession. 

For moving objects the initial frame k where detection 
occurs is taken as the starting point. Trajectory estimates are 
then initialized for each object independently corresponding 
to the frame k assuming the camera is static. 

Z?) 3D Object Motion Estimation: Once 3D trajec¬ 
tories are estimated for each object independently, we need 
to map these trajectories onto the world coordinate system. 
Since, we are dealing with stereo data and for every frame 
we have 3D information, this mapping can be represented 
as simple coordinate transformations. Also, since we are 
not dealing with monocular images, the problem of relative 
scaling can be avoided. 

Given the pose of the real camera in the frame 
{{R^.T^)) and virtual camera {R^^Tf) [7] computed during 
trajectory initialization described earlier, we should be able 
to compute the pose of the object (R^^Tjf) relative to its 
original position in the first frame in the world coordinate 
system. The object rotation R^ and translation are given 
as 

Rl = (TT - m (9) 

Thus we get the localization and sparse map of both the static 
and moving world. We found this approach to object motion 
estimation to be better on both small and long sequences 
than VISO 2 [16]. 

^ Henceforth referred as camera 










A. Bundle Adjustment 

Once 3D object motion and structure initialization has 
been done, we need to refine the structure and motion using 
bundle adjustment (BA). In this section, we describe our 
framework for BA to refine the trajectory and sparse 3D 
point reconstruction of dynamic objects along with several 
novel constraints added to BA that increase the accuracy of 
our trajectories and 3D points. We term these constraints 
semantic or contextual constraints since they represent our 
understanding of the world in a geometric language, which 
we use to effectively optimize 3D points and trajectories in 
the presence of noise and outliers. The assumptions underly¬ 
ing these constraints derive from commonly observed shape 
and motion traits of cars in urban scenarios. For example 
the normal constraints follow the logic that the motion of 
a dynamic object like a vehicle is always on a plane (the 
road surface) and hence constrained by its normal. Similarly, 
the 3D points on a dynamic object are constrained to lie 
within a 3D “box” since dynamic objects like cars cannot be 
infinitely large. Finally, our trajectory constraints encode the 
fact that dynamic objects have smooth trajectories, which is 
often true in urban scenarios. In summary, we try to minimize 
the following objective function 

L BA2D + ABA3D + ATC + NC + BC (10) 

i p^V{i) 

where BA2D represents the 2D BA reprojection error (||v^ — 
K[Ri I Ti]Xp\\^), BA3D represents the 3D registration er¬ 
ror common in optimization over stereo images (\\Xp — 
[Ri I TilXpW'^) and TC, NC, BC represent various optimization 
terms that can be seen as imposed constraints on the resulting 
shape and trajectories as explained below. Here i indexes into 
images, and ~ represents variables in the camera coordinate 
system, with other quantities being expressed in the world 
coordinate system. Also, p eV (/) represents pixels visible 
in image i. 

1 ) Planar Constraint: We constrain motion to be perpen¬ 
dicular to the ground plane where the ground plane normal 
is found from the initial 3D reconstruction of the ground. 

NCI: Ng^Tc (11) 

where Ng is the normal of the ground plane, tc is the direction 
of translation in the world coordinate system. Since 3D 
reconstruction of the ground can be noisy, estimation of Ng 
is done using least squares. Alternatively, we could follow 
a RANSAC based framework of selecting m top hypotheses 
for the normal Ng (/ = 1.. .m), and allow bundle adjustment 
to minimize an average error of the form 

m 

NC2: (12) 

i=l 

2) Smooth Trajectory Constraints: We enforce smooth¬ 
ness in trajectory, a valid assumption for urban scenes, by 
constraining camera translations in consecutive frames as 

||(r/-7;^-i)x7;^)|| (13) 


where are the 3d translations at frame k and k-1. 

Alternatively, we could also minimize the norm between two 
consecutive translations unlike TCI, which only penalizes 
direction deviations in translation. 

TC2: \\{Tp'^-2*T^+ 7^^)^ (14) 

3) Box Constraints: Depth estimation of objects like 
cars are generally noisy because their surface is not typ¬ 
ically Lambertian in nature, and hence violates the basic 
assumptions of brightness constancy across time and view¬ 
ing angle. Furthermore, noise in depth infuses errors into 
the estimated trajectory through the trajectory initialization 
component. To improve the reconstruction accuracy in such 
cases, and to limit the destructive effect that noisy depth 
has on object trajectories, we introduce shape priors into the 
BA cost function that essentially constrains all the 3D points 
belonging to a moving object to remain with a “box”.KITTI 
More specifically, let X-’&Xf be two 3D points on a moving 
object O . For every such pair of points on the object, we 
define the following constraint 


BCl: £||z/’-Xj’-B(;,7)|P (15) 

Vij' 

-5 <B{iJ) < 5 


where B{iJ) is a vector of bounds with individual compo¬ 
nents (hx{ij),by{ij),hz{ij)) and 5 is a vector of positive 
values. 

Note that the above equation is defined for every pair of 
points on the object, which leads to a quadratic explosion of 
terms since B{iJ) is a separate variable for each pair. 

a) Alternate Formulations: One way to reduce the 
explosion would be to reduce the number of variables added 
because of the box constraints to BA. This could be done by 
alternatively minimizing the following terms instead of the 
constraint in equation ([T^ 

BC2 : ^ \\X^-X^-b{i,j)f,-5<b{i,j) < 5(16) 

V(ij') 

BC3: Y, IXi’-Xj<B<5 (17) 

V(U) 

BC4: Y \\Xi’-Xj-bf,-8 <b<8 (18) 

V(i,7) 


where b{ij) in equation is a scalar common to all 3 
dimensions, B (equation ( |17| )) is a 3 x 1 vector common to 
all point pairs, and b (equation (1^)) is a scalar common to 
all pairs and dimensions. 

b) Alternate Minimization Strategies: It is 

now known that a lot of information in terms like 
BC1,BC2,BC3,BC4 above are redundant in nature [20], 
and there is essentially a small ’’subset” of pairs which 
is sufficient to produce optimal or near-optimal results in 
such cases. However, it is not clear how to pick this small 
subset. Here, we take the help of the Johnson-Lindenstrauss 
theorem and its variants [22], [23], [24], to select a random 
set of pairs from the ones available, such that we closely 
approximate the BC error when all the point pairs are used. 


TCI : 



More specifically, the terms expressed in 
BC1,BC2,BC3,BC4 can all be expressed in the form 

BCLin: \\AX-Bf (19) 

where X is a concatenation of all 3D points, and 5 is a 
collection of all car bounds. The matrix A is constructed in 
such a way that each row of A consists of only two non-zero 
elements at the and positions with values 1 and —1 
respectively, and they represent the difference X-’ —Xj. Note 
that the dimensions of A are of the order 3^C2 x 3n, where n 
is the number of 3D points. Notice that for n = 3000, ^€2 is 
approximately 4.5 million, and is highly slow to optimize! 
To reduce this computational burden, we embed the above 
optimization problem in a randomly selected subspace of 
considerably lower dimension, with the guarantee that the 
solution obtained in the subspace is close to the original 
problem solution with high probability. To do this, we draw 
upon a slightly modified version of the affine embedding 
theorem presented in [24] which states 

Theorem 5.1: For any minimization of the form ||AX — 
B\\, where A is of size mxn and n, there exists a sub¬ 
space embedding matrix S : W where t = polyffi/e) 

such that 


\\SAX-SB\\2 = (l±e)||AX-5||2 (20) 


Moreover, the matrix S of size t x m is designed such that 
each column of S has only 1 non-zero element at a randomly 
chosen location, with value 1 or —1 with equal probability. 

Note that since elements of S are randomly assigned 1 or 
-1, the above transformation cannot be exactly interpreted 
as a random sampling of pairs of points. However for the 
sake of implementation simplicity, we “relax“ 5 to a random 
selection matrix. As we show later, empirically we get very 
satisfying results. 

There can be several strategies to select random pairs 
of points for box constraints. We experimented with the 
following in this paper. 


• Strati: Randomly select pairs from the available set. 

• St rat 2: Randomly select one point, and create its pair 
with the 3D point that is farthest from the selected point 
in terms of Euclidean distance. 

• St rat 3: Randomly select one point, and sort other 
points in descending order based on Euclidean distance 
with selected point. Pick the first point from the list that 
has not been part of any pair before. 


Once the proper set of constraints are selected from the 
above choices, the final objective function in equation 1^ is 
minimized with L 2 norm using CERES solver. [13]. 


VI. EXPERIMENTAL RESULTS 

In this section we provide extensive evaluation of our 
algorithms on both synthetic and real data. For real datasets, 
we have used the KITTI tracking dataset for evaluation of 
the algorithm as the ground truth for localization of moving 
objects per camera frame is available. It consists of several 
sequences collected by a perspective car-mounted camera 



Fig. 4: Comparison of trajectory errors of our algorithm to VISO 
2 [16] and standard BA. The histogram plots RMSE magnitude on 
the x axis, and number of pose measurements in the trajectory that 
fall within a particular range on the y axis. Note that most of our 
errors are concentrated on the left (low error), while VISO 2 [16] 
and BA are more evenly spread. The total summed error is: 2D-BA 
- 1.7919, VIS02 - 2.6185, OUR Approach - 1.5429. 


driving in urban,residential and highway environments, mak¬ 
ing it a varied and challenging real world dataset. We have 
taken four sequences consisting of 30, 212, 30, 100 images 
for evaluating our algorithm. We choose these 4 sequences 
as they pose serious challenges to the motion segmentation 
algorithm as the moving cars lie in the same subspace as 
the camera. These sequences also have a mix of multiple 
cars visible for short duration along with cars visible for the 
entire sequence which allows us to test the robustness of our 
localization and reconstruction algorithms on both short and 
long sequences. 

We do extensive quantitative evaluation on synthetic 
dataset. We generated 1000 3D points on a cube attached 
to a planar ground to simulate a car and road. We then 
move the car over the road, while simultaneously moving 
the camera to generate moving images after projection of 
the 3D points. Finally we add Gaussian noise to both the 
3D points on the car and the points on the road to simulate 
errors in measurement. Correspondences between frames are 
automatically known as a result of our dataset design. 


A. Quantitative Evaluation of BA 

In this section, we do an extensive evaluation of the 
different terms proposed in Section |V] Note that we tried all 
the different terms and strategies proposed here on real data 
as well, and in all cases conclusions derived from synthetic 
data experiments are consistent with real data. 

1) Evaluating Terms and Strategies: In the following 
section we present the results for evaluation of various terms 
and strategies. 

a) Normal Constraint: This constraint is a contextual 
constraint in the sense that it enforces the fact that the 
moving object is usually attached to a planar ground in 
urban settings, and so any deviation of the object trajectory 
along the direction of the normal of the ground plane should 
be penalized. While NCI computes a least-squares estimate 
for the normal which is optimal under Gaussian noise, 
NC2 computes several normal hypotheses using a RANSAC 
framework. Figure (6a) shows the results comparing the two 
terms. We find that NCI normally performs better. 





















Error Type 

MS 

MS+Normal 

MS+Normal 

+Trajectory 

MS+Normal+ 
Trajectory+Box 
(1000 constr) 

2D 

rsme 

2.425649 

2.362224 

2.351205 

2.302849 

2D 

mean 

1.989408 

1.955466 

1.969793 

1.937154 

2D 

median 

1.669304 

1.616398 

1.685272 

1.640389 

3D 

rsme 

3.627977 

3.587194 

3.352087 

3.270264 

3D 

mean 

2.544718 

2.527314 

2.398578 

2.367702 

3D 

median 

2.000463 

1.997689 

1.941246 

1.928450 

2D+3D 

rsme 

2.357187 

2.305733 

2.296139 

2.254192 

2D+3D 

mean 

2.035764 

1.986784 

1.971698 

1.881728 

2D+3D 

median 

1.877257 

1.759010 

1.760857 

1.756554 



Error Type 

Without MS 

MS 

MS+Normal 

MS+Normal 

+Trajectory 

2D 

rsme 

1.416246 

1.001566 

0.941971 

0.958505 

2D 

mean 

1.212164 

0.826189 

0.764188 

0.779054 

2D 

median 

1.088891 

0.677419 

0.690825 

0.716546 

3D 

rsme 

1.476649 

0.959499 

0.975747 

0.978197 

3D 

mean 

1.272985 

0.786729 

0.822169 

0.824090 

3D 

median 

1.279508 

0.712513 

0.773672 

0.769680 

2D+3D 

rsme 

1.472399 

0.958505 

0.958541 

0.958541 

2D+3D 

mean 

1.269541 

0.779054 

0.779132 

0.779132 

2D+3D 

median 

1.269238 

0.716546 

0.716967 

0.716967 


TABLE I: Static scene of the KITTI dataset. The dataset has 212 TABLE II: Dynamic scene of KITTI dataset of 212 frames. Note 

frames. Note that adding Motion Segmentation (MS) drastically that adding box constraints over normal and trajectory lead to the 

improves results. best results. 



Eig. 5: Synthetic results for box constraints. Note that in the two 
experiments we added a large amount of noise and picked 1000 
constraints from around 500000 pairs of points. While there isn’t 
much difference between the terms in (a) as such, StratS performs 
better than others in (b). 


b) Trajectory Constraint: The trajectory constraint 
enforces smoothness in moving object trajectories, by either 
enforcing that the direction of motion should not change 
significantly between consecutive frames (TCI) or enforcing 
that both direction and magnitude must be constrained (TC2). 
Eigure ( [6b| ) plots comparative results, and we infer that TC2 
performs better. 

c) Box Constraint: Box constraints enforce that the 
3D reconstruction of the moving object in consideration 
must be compact. This is a useful constraint since gross 
errors in the depth of the object as estimated by the stereo 
algorithm [19] normally are not corrected by BA since it 
settles into a local minima. Thus, to “focus” the BA towards 
better optimizing the 3D structure, we add these constraints. 

d) Box Sampling Strategies: Since box constraints 
lead to an explosion of terms added to BA, we experiment 
with 4 strategies to reduce this computational burden by 
random sampling [22]. Eigure 0 show results for various 
terms of box constraints, and various strategies to optimize. 
Normally we find that BCl along with StratS performs best. 

B. Trajectory Evaluation 

We compare the estimated trajectories of the moving 
objects to the extended Kalman filter based object tracking 
VIS02 (Stereo) [16]. VIS02 S(Stereo) has reported error 
of 2.44 % on the KITTI odometry dataset, making it a good 




Eig. 6: Synthetic results for Normal and trajectory constraints. 


baseline algorithm to compare with. As proposed by Sturm 
et al. [10], the comparison methodology is based on ATE for 
root mean square error(RMSE), mean, median. We use their 
evaluation algorithm which aligns the 2 trajectories using 
SVD. We show the three statistics as mean and median are 
robust to outliers while RMSE shows the exact deviation 
from the ground truth. 

The Table ^ depicts the trajectory error estimation of 
the odometry of the KITTI 1 sequence, the table shows the 
enhancement in the ATE error of the odometry with Motion 
segmentation and without motion segmentation. Similarly, 
Table ^ depicts the trajectory error for moving object 
visible in all the 212 images of the sequence . This table 
depicts the improvement of the trajectory of moving objects 
with the help of semantic constraints imposed on the motion 
of the moving object.We show how each constraint on the 
motion of the moving object complement the trajectory 
computation and reconstruction of the dynamic objects. 

Eor quantitative evaluation of our method, we have com¬ 
puted the trajectories of all the moving objects. These 
trajectories are compared to their respective ground truth and 
the absolute position error of each pose is computed. We have 
done a histogram based evaluation of all the position error 
as depicted in EigQ,here we compare the trajectories of our 
algorithm with VIS 02. We have evaluated the algorithm for 
a complete of 297 poses of moving objects and found that 
our approach outperforms VIS02 and standard 2D bundle 
adjustment . The trajectories and reconstruction of some of 
the moving objects is depicted in the Eig(|7]). 



























































KITTI 1 







Fig. 7: We show the (INPUT) image sequences for which we compute the semantic motion segmentation (SMS). We have depicted 
the reconstruction of moving objects with their trajectories (3D-REC). Blue trajectories represent the camera capturing the scene. All 
segmentation color labels are consistent with Figure (best viewed in color) 


VII. CONCLUSION 

In this paper, we have proposed a joint labelling frame¬ 
work for semantic motion segmentation and reconstruction 
in dynamic urban environments. We modelled the problem 
of creating a semantic dense map of moving objects in a 
urban environment using trajectory optimization . The ex¬ 
periments suggest that semantic segmentation provide good 
initial estimates to aid generalized bundle adjustment based 
approach. This helps in improving the localization of the 
moving objects and creates an accurate semantic map. 
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